<html>
<head>
	<meta charset="utf-8">

  <title>Apache Spark Performance</title>
	 <!-- Bootstrap -->
    <link href="css/bootstrap.min.css" rel="stylesheet">
    <link href="css/index2.css" rel="stylesheet">
 
	
    <script src="js/jquery-1.11.2.min.js"></script>
	<script src="js/bootstrap.min.js"></script>
	<script src="js/index2.js"></script>

</head>
<body>
	<div id="framework">
		<div id="divlogo">
		<h1>Apache Spark Performance</h1>
		</div>
		
		<nav class="navbar navbar-inverse" role="navigation" style="border-radius:0px">
			<div id="divNavBar" class="container">
      			<ul class="nav navbar-nav">
        			<li><a href="index.html" onclick="SwitchResultList(false)">Home</a></li>
        			<li><a href="result.html" onclick="SwitchResultList(true)" >Result</a></li>
         			<li><a href="about.html" onclick="SwitchResultList(false)">Documentation</a></li>
      			</ul>
	 		</div>
		</nav>

		<div id="divContainer">
			<div id="divContainer-left">

			<ul id="ResultList" class="list-group" style="display:none">
			  <li class="list-group-item"><a href="plaf1.html">Plaf A</a></li>
			  <li class="list-group-item"><a href="plaf2.html">Plaf B</a></li>
			  <li class="list-group-item"><a href="plaf3.html">Plaf C</a></li>	
			  <li class="list-group-item"><a href="plaf4.html">Plaf D</a></li>
			</ul>	


			<ul  class="list-group">
			  <li class="list-group-item"><a href="index.html" onclick="SwitchResultList(false)">Home</a></li>
			  <li class="list-group-item"><a href="result.html" onclick="SwitchResultList(true)" >Result</a></li>
			  <li class="list-group-item"><a href="about.html" onclick="SwitchResultList(false)" >Documentation</a></li> 
			</ul>
			
			<!--Anchor-->
			<ul id="anchors" class="list-group" style="position:fixed;top:200px;right:300px;display:none">
			  <li class="list-group-item"><a href="#time" >Time</a></li>
			  <li class="list-group-item"><a href="#release" >Release</a></li>
			</ul>

			</div>
			<div id="divContainer-right">


<h1>Apache Spark Performance</h1>
<h2>ABOUT</h2>
<p>Our goal is to work with the Apache Spark community to further enhance the performance of the Apache Spark. The data available on this site allows community members to closely track performance gains and losses with every week version of the Apache Spark. Ultimately, we hope that this data will result in consistent performance increases.</p>
<h3>SparkBench</h3>
<a href="https://github.com/Intel-bigdata/Sparkbench">https://github.com/Intel-bigdata/Sparkbench</a>

<h4>Spark Core:</h4>
1. Sort (sort)</br>
This workload sorts its text input data, which is generated using RandomTextWriter.</p>

2. WordCount (wordcount)</br>
This workload counts the occurrence of each word in the input data, which are generated using RandomTextWriter. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set.</p>

3. Sleep (sleep)</br>
This workload sleep an amount of seconds in each task to verify task scheduler.</p>

<h4>Spark SQL:</h4>
4. Scan (scan)</br>
Measure the throughput of the SparkSql cluster by query a large table and write the result back to HDFS:</br>
FROM rankings SELECT *</p>

5. Join (join)</br>
Join two large tables with some WHERE conditions, GROUP BY and ORDER BY operations:</br>
SELECT sourceIP, sum(adRevenue) as totalRevenue, avg(pageRank) FROM rankings R JOIN (SELECT sourceIP, destURL, adRevenue FROM uservisits UV WHERE (datediff(UV.visitDate, '1999-01-01')>=0 AND datediff(UV.visitDate, '2000-01-01')<=0)) NUV ON (R.pageURL = NUV.destURL) group by sourceIP order by totalRevenue DESC limit 1</p>

6. Aggregate (aggregation)</br>
Query large table with SUM and GROUP BY operations:</br>
SELECT sourceIP, SUM(adRevenue) FROM uservisits GROUP BY sourceIP</p>

<h4>Spark Graphx:</h4>
7. PageRank (pagerank)</br>
This workload benchmarks PageRank algorithm implemented in Spark-MLLib examples. The data source is generated from Web data whose hyperlinks follow the Zipfian distribution.</p>

<h4>Spark MLLib:</h4>
8. Bayesian Classification (bayes)</br>
This workload benchmarks NaiveBayesian Classification implemented in Spark-MLLib examples.</br>
Large-scale machine learning is another important use of MapReduce. This workload tests the Naive Bayesian (a popular classification algorithm for knowledge discovery and data mining) trainer in Mahout 0.7, which is an open source (Apache project) machine learning library. The workload uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.</p>

9. K-means clustering (kmeans)</br>
This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in Mahout 0.7. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.</p>

<h3>Spark-perf</h3>
<a href="https://github.com/databricks/spark-perf">https://github.com/databricks/spark-perf</a></p>

Suites of performance tests for Spark, PySpark, Spark Streaming, and MLlib.</br>
Parameterized test configurations:</br>
Sweeps sets of parameters to test against multiple Spark and test configurations.</br>
Automatically downloads and builds Spark:</br>
Maintains a cache of successful builds to enable rapid testing against multiple Spark versions.</p>

	</div>
		<hr style="clear:both" />
		<div id="footer">
		
			Copyright &copy 2015 Intel Corporation. All rights reserved.
		<em>*Other names and brands may be claimed as the property of others.</em>
		
		</div>

	</div>
</body>

</html>
